Ensuring that people and goods are moved safely across Ontario is one of the major objectives of Ontario’s Ministry of Transportation (MTO). To this end, the ministry often conducts surveys of goods, goods value, passengers and vehicular movement across its five regions around the province.
In this project, we examined the survey data provided by the MTO at certain spots in the province of Ontario. According to the metadata, the data provided spans a two-week period in the year 2008. The aim was to derive a few key insights about vehicular movement in the province by extending this small sample to make generalizations about the state of traffic in Ontario. The sample contains over 17000 records and 13 attributes; though it contains only two weeks of data, it seems adequate enough to derive insights about the population from which it was obtained. Some of the questions we examined, in no particular order, include:
Contained in the data is information about traffic stations, highways, MTO regions, week, days and hours of traffic. This notebook takes a modular technique in its analysis of the available data by examining the traffic situation first by stations, then by highways, by MTO regions, directions, vehicle types and by weeks, hours and days of traffic. In cases where it is necessary, combination analysis which jointly examined two or more influences on traffic (traffic highways and vehicle types, or traffic at stations on different highways, for example) was also conducted. Insights derived are delayed in each of the title headings below.
At the end, some machine learning techniques were also used to shed light on additional details inherent in the data. Clustering was employed to split the data into areas with different traffic densities across different stations. We used fast fourier transform to carry out some traffic forecasting on the biggest highway in the province. FFT seems fairly suited to predictable data such as this. Finally, we used logistic regression to predict traffic probabilities on some of the highways. The results obtained in most of the analyses conducted appeared consistent, showing repeatedly that certain highways entertained more traffic than others. Some of the curious observations made also include the realization that traffic density or distribution across the province may not have changed a lot since 2008 when this data was released.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.max_columns=20
%matplotlib inline
Read data into pandas DataFrame and rename columns.
# Read the csv file into notebook as a pandas dataframe
data = pd.read_csv('https://files.ontario.ca/opendata/2006_commercial_vehicle_survey_-_traffic_volumes_at_survey_stations.csv',
header=0, names=['stationID', 'stationName', 'direction', 'MTORegion', 'highway', 'locationDescription',
'dayWeekNo', 'hour', 'single', 'multi', 'auto', 'totalTrucks', 'totalVehicles'])
data.info()
Most of the attribute types are already correctly deciphered by Pandas. The other object datatypes are all string types, but their existing form seems okay for analysis.
Next we check if the data set contains any null data.
data.isnull().sum()
There are also no null values in the data set.
We use the describe function to take a look at some of the statistics (and distribution) of the data set. The data has 17,472 records. We also see from examining std against the mean of the individual attributes that the spread of the attributes seems to be quite a lot. This suggests that the data could be heavily skewed or spread over a huge interval. It could also imply that it contains a significant number of outliers. This spread is particularly noticeable because the value of the 75% percentile for every data is orders of magnitude less than the maximum value.
data.describe()
data.head()
There are 33 unique highway values in the data set, 104 unique stationID values and 5 unique MTORegions.
In addition, it appears that contrary to the information in the data set metadata, there is actually only a week of data available in the data set.
stationID identifies the id of the traffic monitoring station. stationName is the name of the station. direction provides the direction of traffic at any one time, while MTORegion provides one of the five Ontario MTO regions. locationDescription describes the location of traffic, often specifying a stretch of the highway where the traffic was logged. dayWeekNo and hour are the week day and hour respectively during this the record was collected. single, multi, auto and totalTrucks specifies the number of single unit trucks, multi unit trucks, auto or passenger vehicles and total number of trucks (sum of single and multi units). totalVehicles provides the sum total of auto and truck traffic.
| Field | Field Description |
|---|---|
Station ID |
Unique CVS station number, representing the station from which the data was collected. It takes the form ONXXXX where XXXX is a 4-digit number. There are 104 unique station IDs present in the data set |
Station Name |
Station name; each Station ID carries a station name |
Direction |
This includes four different traffic directions; east, west, north and south. |
MTO Region |
Five regions of MTO (Central, Eastern, Southwestern, Northern, Northwestern) |
Highway or Road |
Highway number or road name. There are 33 unique highways in the data set |
Location |
Description of location |
Day of Week Number |
A number between 1 and 7 representing day of week. 1=Sunday, 7= Saturday |
Hour |
Hour of day, 0 to 23 represents starting hour of the day (e.g. 12 represents 12 P.M. - 1 P.M.) |
Single |
Number of single unit trucks |
Multi |
Number of multi-unit trucks |
Auto |
Number of cars and other passenger vehicles |
Total Trucks |
Sum of single and multi-unit vehicles |
Total Vehicles |
Number of total vehicles |
print('Number of unique highway values are: {}\n\
Number of unique station values are: {}\nNumber of unique MTO Regions are: {}'
.format(data['highway'].nunique(), data['stationID'].nunique(), data['MTORegion'].nunique()))
data[data.stationID == 'ON0016'].dayWeekNo.unique()
Next, we examine histograms of the respective quantitative data available in the data set. The histograms confirm the skewness in the distribution of the data. All share somewhat similar distributions and have more data count at lower values than at higher values. The vehicle traffic data for all categories of vehicle is heavily right-skewed. This is because there is much less traffic present on more days and hours than there is heavy traffic on others. This makes sure all the histograms produce data which is congregated on one side of the chart. We could also employ a log scale in gaining better insight into the nature of the distribution.
# Create Facet Grid with seaborn and insert distribution plot for each of the numeric varables
# in the dataset
sns.set(font_scale = 3)
vehicles = data[['single','multi','auto','totalTrucks','totalVehicles']].melt()
g = sns.FacetGrid(vehicles, col="variable", col_wrap= 2, sharey=False, sharex=False, height=10, aspect=2)
g = g.map(sns.distplot, "value", kde=True)
plt.savefig('histogram_normal_scale.png')
data1 = data.copy()
# transform data with logarithms
data1 = np.log(data1[['single','multi','auto','totalTrucks','totalVehicles']] + 1)
# Convert data into tidy data format and plot in a facet grid five different distribution plots
# belonging to each numeric column in the data set.
sns.set(font_scale = 3)
melted_data = data1[['single','multi','auto','totalTrucks','totalVehicles']].melt()
g = sns.FacetGrid(melted_data, col="variable", col_wrap= 2, sharey=False, sharex=False, height=10, aspect=2)
g = g.map(sns.distplot, "value", kde=True)
plt.savefig('histogram_logscale.png')
The distributions above show that the distribution of totalVehicles parallels the distribution of auto more than it does that of any of the other vehicle types, even though both are skewed. This basically suggests that totalVehicles composition is made up more of auto than of any of the other vehicle types. This could mean that there are more autos and passenger vehicles in the province than there are trucks. In addition, the totalTrucks histogram shares more similarity with multi than it does with single. This emphasizes the point that there are more multi trucks in the dataset, and possibly in Ontario, than there are single trucks.
However, the difference between all of them does seem significant. We could perhaps confirm this with a statistical test. We could employ the Krukal-Wallis test for completely randomized design since the individual group is not normal.
from scipy.stats import mstats
from scipy.stats import kruskalwallis
# Compute Kruskal-Wallis statistic to check if there is any difference
# between the different non-normal distributions
single = data['single'].values
multi = data['multi'].values
auto = data['auto'].values
totalTrucks = data['totalTrucks'].values
totalVehicles = data['totalVehicles'].values
print("Kruskal Wallis H-test test:")
H, pval = mstats.kruskalwallis(single, multi, auto, totalTrucks, totalVehicles)
print("H-statistic:", H)
print("P-Value:", pval)
H2, pval2 = ranksums(multi, totalTrucks)
print(H2, pval2)
The p-value is very low and confirms that there is significant difference between the distributions of the different vehicle types. It doesn't say however which is different from which.
def upper_fence(ser):
return (np.quantile(ser, .75) +
1.5 * np.subtract(*np.quantile(ser, [.75,.25])))
# Create 8 subplots and plot the two boxplots each for the different numeric variables.
# The boxplots on the left show the data with outliers; that on the right shows without outliers.
fig, ax = plt.subplots(4,2,figsize=(10,12))
i=0
for item in ['single', 'multi', 'auto', 'totalTrucks']:
sns.boxplot(y=data[item], ax=ax[i,0])
sns.boxplot(y=data[item][data[item]<=upper_fence(data[item])-5], ax=ax[i,1])
i += 1
The boxplots confirm the presence of a lot of seeming outliers in the data set. However, since these are valid entries in the data set, we will not be discarding them. The property of this kind of data seems to suggest that outlier will or should be present, since on certain days and in many more locations there will be less traffic than there are in other fewer congested areas where traffic will be heavy.
# Determine the number of unique variables in the station column
data.stationID.nunique()
The number of unique stations in the data set is only 104. Some of these span several highway data, while others are restricted to only one highway data. We will be taking a look at some of the more important stations closely.
First, we examine which stations recorded the biggest traffic?
# Group the data set by stationID and determine the number of different vehicle types present
data.groupby('stationID').mean().sort_values(by=['totalVehicles','auto'], ascending=False)[:10]
Stations ON0115, ON0116 and ON0140 recorded the biggest traffic around this period in the Ontario province. These counts may not provide the best insight into how significant the traffic recorded at these stations are, so it might be proper to examine the percentages.
# Group the dataset by stationID and determine the mean values of the different vehicle counts, sort by totalvehicles first,
# and auto second. The count as a fraction of total vehicles is obtained for each column.
stationsTraffic = data.groupby('stationID').mean().sort_values(by=['totalVehicles','auto'], ascending=False)
stationsTraffic['trafficPercentage'] = stationsTraffic['totalVehicles'] * 100 / stationsTraffic['totalVehicles'].sum()
stationsTraffic[:10]
The percentages show that the three stations ON0115, ON0116 and ON0140 contribute all together close to 30% of the traffic experienced in all of Ontario. That's pretty significant. We will see later that these three stations are located on just one highway!
sns.set(font_scale = 1)
ax = stationsTraffic[:40]['trafficPercentage'].plot(kind='bar', figsize=(15,6))
ax.set_xticklabels(stationsTraffic[:40].index,rotation=65)
ax.set_title('Percentage of Traffic per Station', fontsize=20)
ax.set_ylabel('Percentage', fontsize=15)
ax.set_xlabel('Station', fontsize=15)
plt.savefig('Percentage_Traffic_per_Station.png')
plt.show()
The contributions from the first three stations are indeed rather significant as can be seen from the bar chart above.
Examining just the first few stations, we can establish that auto contributes the most traffic of all the kinds of vehicles. There are clearly more passenger vehicles in Ontario than there are trucks or other kinds of vehicles. totalTrucks represent the sum of auto and multi.
ax = stationsTraffic[['single',
'multi',
'auto']][:10].plot(kind='bar', figsize=(15,8))
ax.set_xticklabels(stationsTraffic[:10].index,rotation=65)
ax.set_title('Vehicle Count per Station', fontsize=20)
ax.set_ylabel('Vehicle Count', fontsize=15)
ax.set_xlabel('Station', fontsize=15)
ax.legend(fontsize=15)
plt.savefig('Vehicle_Count_per_Station.png')
plt.show()
Most stations show higher auto traffic than they do of other kinds of vehicles. ON0115 and ON0116 are some of the stations with the biggest auto traffic. Station ON0033 appears to have somewhat significant truck traffic.
We could also examine the correlation between each station. This will be the cross correlation between the respective time series data for each of the stations. Since there are 104 individual stations, we obtain a 104 x 104 correlation matrix which provides information on which station is correlated with which. We could use this to examine which stations share similar traffic patterns. Such traffic patterns could be correspondingly huge or small, but are unlikely to differ.
# Obtain pairwise correlation of the station data.
# Each station is compared to another in a pairwise fashion and a 104x104 matrix is produced
# There are 104 unique stations in the data set.
xcorr_values = pd.DataFrame()
for i in data.stationID.unique():
for j in data.stationID.unique():
xcorr_values.loc[i,j] = np.corrcoef(data['totalVehicles'][data['stationID'] == i],
data['totalVehicles'][data['stationID'] == j]
)[0,1]
xcorr_values.iloc[:10,:10]
A slice of the matrix shows that most of the time series data correlate well with others. With average correlations more than 70%, it appears traffic rises pretty much at the same rate all over the province and then settles back. The time plot below seems to confirm this for, at least, the first few stations.
# Sort the correlation values corresponding to station ON0115 in descending order
xcorr_values.loc['ON0115'].sort_values(ascending=False)[:10]
From the cross correlation matrix obtained, we can examine some of the other stations which correlate greatly with station ON0015. Ranking these values in descending order shows that ON0026 and ON0027 are some of the other stations with traffic which correlates rather highly with traffic at ON0115. We will examine the highways at these stations in later cells while examining highways.
Now, we examine the pattern of traffic at different stations. The signature shows a big rise in auto traffic from the beginning of the day. This traffic peaks at about noon before settling back to near zero towards midnight. This pattern repeats every day of the week with some little differences. Some stations experience bigger traffic than others. Other stations show expectedly bigger peak on the first day of the week when people rush to work, and lesser traffic peaks for the rest of the week. totalTrucks traffic is mild at some stations and huge at others.
With a uniform yaxis, we can visually compare the scale of traffic at these different locations, and can immediately establish, for example, that traffic on ON0021 is bigger than that on ON0017, or that stations ON0019 and ON0021 exhibit curious peaks in the morning and afternoon hours respectively.
sns.set(font_scale = 1)
fig, ax = plt.subplots(3,2, figsize=(11,10))
# plt.figure(figsize=(10,8))
for i, station in enumerate(data['stationID'].unique()[:6]):
(ax[i//2,i%2]
.plot(data[['single', 'multi', 'auto', 'totalTrucks']][data['stationID'] == station]))
(ax[i//2,i%2].
set_title(station))
(ax[i//2,i%2].
set_ylim(0,5500))
ax[i//2,i%2].legend(['single', 'multi', 'auto', 'totalTrucks'])
plt.tight_layout()
This view above offers some insight into the way traffic is distributed across the highways. We see that station ON0021 houses a particularly curious peak during afternoon hours each working day.
The plot is reproduced below to get a closer view.
data[data.stationID == 'ON0021'][25:50].reset_index()[['auto']].plot(figsize=(7,5))
plt.title('Traffic recorded at ON0021 on a Tuesday', fontsize=17)
plt.xlabel('Auto Traffic', fontsize=15)
plt.ylabel('Time', fontsize=15)
None
This station is located on Hwy 400 between King Rd & Kettleby Rd north of Vaughan. The map below shows us the location of this station. This spike looks to be the daily homeward traffic from downtown Toronto in the nothern direction as seen in the information displayed below. A driver who cares to avoid this traffic would perhaps seek an alternative route as this traffic seems a daily occurence. He might as well leave the office just as the sharp traffic peak is settling after 1500hrs.
print('ON0021 is located on {} and carries traffic in the {} direction'.format(
data[data.stationID == 'ON0021'].locationDescription.unique(), data[data.stationID == 'ON0021'].direction.unique()))
#Install folium to examine maps
!pip install folium
# Run cell to see map of station location!
import folium
from folium.plugins import MarkerCluster
King_to_Kettleby = pd.read_csv('''./mygeodata/On_Highway_400_between_King_Rd_York_Rd_11_Exit_43_Kettleby_Rd_Exit_52_about_2_km_north_of_King_Rd_Km_Marker_45.csv''')
#Add markers to the map
King = [King_to_Kettleby.loc[0, 'Y'], King_to_Kettleby.loc[0,'X']]
Kettleby = [King_to_Kettleby.loc[1, 'Y'], King_to_Kettleby.loc[1,'X']]
#Create the map
my_map = folium.Map(location = King, zoom_start = 11)
folium.Marker(King, popup = 'King Rd').add_to(my_map)
folium.Marker(Kettleby, popup = 'Kettleby Rd').add_to(my_map)
#Display the map
my_map.save('King_to_Kettleby.html')
my_map
Traffic in this direction exhibits repeated peaks every afternoon during the week according to information present in the data set.
data[data.stationID == 'ON0115'].highway[:4]
We confirmed earlier that the biggest daily traffic is registered by station ON0115. It appears this station is located on Hwy 401. We will now look closely at the respective highways in the data set.
First, we visualize the nature of traffic in some of the stations which cover Hwy 401. By paying attention to the y-scale, we see that ON0115, ON0116, ON0139, and ON0140 appear to carry rather heavy traffic compared to others as shown in the plots below.
fig, ax = plt.subplots(7,2, figsize=(10,9), gridspec_kw={'wspace': 0.3, 'hspace': 1.0})
for i,j in enumerate(data[data['highway'] == 'Hwy 401']['stationID'].unique()):
ax[i//2, i%2].plot(data[['single',
'multi',
'auto',
'totalTrucks']][(data['highway'] == 'Hwy 401') & (data['stationID'] == j)],
label=j)
ax[i//2, i%2].set_title(j, fontsize=13)
plt.savefig('Traffic_per_Station_Hwy401.png')
None
This initial analysis confirms what we already know; that Hwy 401 registers some of the biggest traffic in Ontario.
We could examine where ON0115 is located on the map.
data[data['stationID'] == 'ON0115'].locationDescription.unique()
Keele_to_MTO = pd.read_csv('''./mygeodata/On_Highway_401_from_Hwy_401_eastbound_collectors_west_of_Keele_St_to_MTO_Parking_Lot_north_of_Hwy_401_on_west_side_of_Keele_St.csv''')
#Add markers to the map
Keele = [Keele_to_MTO.loc[0, 'Y'], Keele_to_MTO.loc[0,'X']]
MTO_Lot = [Keele_to_MTO.loc[1, 'Y'], Keele_to_MTO.loc[1,'X']]
#Create the map
my_map = folium.Map(location = Keele, zoom_start = 13)
folium.Marker(Keele, popup = 'Keele Street').add_to(my_map)
folium.Marker(MTO_Lot, popup = 'MTO Parking Lot').add_to(my_map)
#Display the map
my_map.save('Keele_to_MTO.html')
my_map
This is located westbound from east of Keele Street to the MTO Parking lot north of Hwy 401. This is only a few kilometers north of downtown Toronto.
data[data['stationID'] == 'ON0116'].locationDescription.unique()
Station ON0116 monitors eastbound traffic.
data[data['stationID'] == 'ON0139'].locationDescription.unique()
Kennedy_to_Maint = pd.read_csv('''./mygeodata/On_Highway_401_from_Hwy_401_eastbound_collectors_west_of_Kennedy_Rd_to_Maintenance_Yard_north_of_Hwy_401_on_west_side_of_Kennedy_Rd.csv''')
#Add markers to the map
Kennedy = [Kennedy_to_Maint.loc[0, 'Y'], Kennedy_to_Maint.loc[0,'X']]
Maint_Yard = [Kennedy_to_Maint.loc[1, 'Y'], Kennedy_to_Maint.loc[1,'X']]
#Create the map
my_map = folium.Map(location = Kennedy, zoom_start = 12)
folium.Marker(Kennedy, popup = 'Kennedy Street').add_to(my_map)
folium.Marker(Maint_Yard, popup = 'Maintenance Yard').add_to(my_map)
#Display the map
my_map
Station ON0139 is just off the airport around Mississauga. This is also one of the areas with rather heavy traffic in the province.
The subplots showed that station ON0115 on Hwy 401 records some of the biggest traffic. From the cross correlation matrix examined earlier, we observed that some of the other stations correlate greatly with station ON0015. Ranking these values in descending order shows that ON0026 and ON0027 are some of the other stations with traffic which correlates rather highly with traffic at ON0115. This could mean they also register sizeable traffic. We could take a look.
xcorr_values.loc['ON0115'].sort_values(ascending=False)[:10]
The locationDescription of ON0026 shows that it is on Queen Elizabeth Drive.
data.locationDescription[data['stationID']=='ON0026'].unique()
QEW_to_Dorval = pd.read_csv('./mygeodata/On_QEW_between_Dorval_Dr_Exit_116_Third_Line_Exit_113_Km_Marker_112.csv')
QEW_to_Dorval
QEW_to_Dorval = pd.read_csv('./mygeodata/On_QEW_between_Dorval_Dr_Exit_116_Third_Line_Exit_113_Km_Marker_112.csv')
QEW_to_Dorval
Dorval_Drive = [QEW_to_Dorval.loc[0, 'Y'], QEW_to_Dorval.loc[0,'X']]
Third_Line = [QEW_to_Dorval.loc[1, 'Y'], QEW_to_Dorval.loc[1,'X']]
#Add markers to the map
#Create the map
my_map = folium.Map(location = Dorval_Drive, zoom_start = 13)
folium.Marker(Dorval_Drive, popup = 'Dorval Drive').add_to(my_map)
folium.Marker(Third_Line, popup = 'Third Line').add_to(my_map)
#Display the map
my_map
This is very close to iFly Toronto and is the expressway that leads into Mississauga around Oakville. Little wonder it experiences a lot of traffic.
Is there a much better alternative to Queen Elizabeth Drive for drivers traffic-conscious?
Just north of this expressway is another expressway leading into Missisauga and connecting to Trafalgar Road. It appears there is also a station here. We could also examine how big vehicular traffic here is.
data[data.locationDescription.str.contains('Trafalgar')].stationID.unique()
There are two stations with numbers ON0033 and ON0032. How big is the traffic here compared to the station south of it, and also compared to the station at ON0115?
(data.groupby('stationID')
.mean()
.loc[['ON0115','ON0032','ON0033', 'ON0026']]
.iloc[:, 2:]).sort_values(by='totalVehicles', ascending=False)
Traffic north of ON0026 appears relatively milder. This is probably because there is a toll and as such is avoided by most drivers. The traffic is therefore much less and seems a good alternative for the traffic-conscious driver. Station ON0115 on Hwy 401 remains heavy with traffic.
Now, we sort traffic by highway and arrange the results in order of decreasing mean traffic count.
The highways with the biggest traffic in Ontario appear to be Hwy 401, QEW, King Edward Ave, and Hwy 400.
Hwy 401 runs the length of the south of Ontario extending from its end near Detroit in the United States to the east as far as Ottawa. It apparently is also very busy because it carries traffic westwards and eastwards into Ontario's major city, Toronto.
# Create pivot table which shows the mean number of vehicles on each highway
(pd.pivot_table(data=data, index='highway', values=['multi', 'auto', 'single',
'totalTrucks', 'totalVehicles'], aggfunc='mean')
.sort_values(by='auto', ascending=False)
.head(10))
We could also examine the percentage of vehicular traffic each highway experiences?
# Obtain percentage vehicle volume for each hgihway as a fraction of total traffic in the province.
# this helps shed light on the contribution of each highway to traffic in the province
highwayTraffic = (pd.pivot_table(data=data, index='highway', values=['multi', 'auto', 'single',
'totalTrucks', 'totalVehicles'], aggfunc='mean')
.sort_values(by='auto', ascending=False))
highwayTraffic['trafficPercentage'] = highwayTraffic['totalVehicles'] * 100 / highwayTraffic['totalVehicles'].sum()
highwayTraffic.sort_values(by='trafficPercentage', ascending=False)[:10]
ax = highwayTraffic[:40]['trafficPercentage'].plot(kind='bar', figsize=(15,6))
ax.set_xticklabels(highwayTraffic[:40].index,rotation=65)
ax.set_title('Percentage of Traffic per Highway', fontsize=20)
ax.set_ylabel('Percentage', fontsize=15)
ax.set_xlabel('Highway', fontsize=15)
plt.show()
Here, we are able to establish that only three highways in the province, Hwy 401, QEW and King Edward Ave account for more than 50% of vehicular traffic in the province!
For Hwy 401, we look at the time series traffic data for each day of the week. This could help shed light on exactly how traffic changes each day of the week on this highway.
def plot_hourly_chart(stations,dayWeekNo):
'''
Plot produces a line plot of hour against number of auto vehicles
'''
for station in stations:
X = data[(data["stationID"]==station) & (data["dayWeekNo"]==dayWeekNo)]
ax.plot(X["hour"], X["auto"], label=station)
plt.legend(loc='upper left')
None
import numpy as np
from matplotlib.pyplot import cm
# viridisBig = cm.get_cmap('viridis', 512)
stations = data[data['highway']=='Hwy 401'].stationID.unique()
dayWeekNos = range(1,8)
i=1
plt.figure(figsize=[17,27])
# fig = plt.figure()
for dayWeekNo in dayWeekNos:
fig=plt.subplot(4,2,i)
color=iter(cm.tab20(np.linspace(0,1,len(stations))))
for station in stations:
c=next(color)
X = data[(data["stationID"]==station) & (data["dayWeekNo"]==dayWeekNo)]
fig.plot(X["hour"], X["auto"], label=station, c=c)
fig.set_title(dayWeekNo)
plt.legend(loc='upper left')
# plot_hourly_chart(stations,dayWeekNo)
i+=1
Traffic appears to begin around 5am in the morning and increases gently during the day, peaking on major highways around noon. Vehicle density does not seem to reduce until towards evening around 6pm, so that by midnight it goes down significantly. On the last day of the week, it appears many more cars stay on the road even towards evening (parties? night outs?), as it appears vehicle density does not reduce significantly on this day unlike on others. The consistency in these plots suggest that the probability of getting caught in traffic trends higher as the morning arrives. To beat traffic, one would have to be on the road as early as 5-6am. Leaving any sooner or later does not appear to help.
Quite remarkably, we see a gentle slope for both the first and last day of the week, and steeper slopes for all other days. Could this suggest some lethargy preventing people from hurrying to work on a Monday and more people relaxing at home on the weekend, compared to other busy days of the week? It would certainly be interesting if data from the other highways suggest the same information.
import numpy as np
from matplotlib.pyplot import cm
# viridisBig = cm.get_cmap('viridis', 512)
stations = data[data['highway']=='QEW'].stationID.unique()
dayWeekNos = range(1,8)
i=1
plt.figure(figsize=(8,6))
# fig = plt.figure()
for dayWeekNo in dayWeekNos:
fig=plt.subplot(4,2,i)
color=iter(cm.tab20(np.linspace(0,1,len(stations))))
for station in stations:
c=next(color)
X = data[(data["stationID"]==station) & (data["dayWeekNo"]==dayWeekNo)]
fig.plot(X["hour"], X["auto"], label=station, c=c)
fig.set_title(dayWeekNo)
# plt.legend(loc='upper left')
# plot_hourly_chart(stations,dayWeekNo)
i+=1
It appears data from QEW also suggest the same?
import numpy as np
from matplotlib.pyplot import cm
# viridisBig = cm.get_cmap('viridis', 512)
stations = data[data['highway']=='King Edward Ave'].stationID.unique()
dayWeekNos = range(1,8)
i=1
plt.figure(figsize=(8,6))
# fig = plt.figure()
for dayWeekNo in dayWeekNos:
fig=plt.subplot(4,2,i)
color=iter(cm.tab20(np.linspace(0,1,len(stations))))
for station in stations:
c=next(color)
X = data[(data["stationID"]==station) & (data["dayWeekNo"]==dayWeekNo)]
fig.plot(X["hour"], X["auto"], label=station, c=c)
fig.set_title(dayWeekNo)
# plt.legend(loc='upper left')
# plot_hourly_chart(stations,dayWeekNo)
i+=1
Similarly, data from King Edward Ave seems to suggest the same. There clearly is a steep slope on days 2 to 6, and gentler slopes on days 1 and 7.
Now, we can examine just how significant the traffic is for each highway in the data set. We already know that Hwy 401 contributes the biggest traffic in the province. But how about the other highways?
ax = (pd.pivot_table(data=data, index='highway', values=['multi', 'auto',
'single'], aggfunc='mean')
.sort_values(by='auto', ascending=False)
.head(10)).plot(kind='bar', figsize=(15,8))
labels = ax.get_xticklabels()
ax.set_xticklabels(labels, rotation=45)
ax.set_title('Vehicle Count per Highway ', fontsize=20)
ax.set_ylabel('Vehicle Count', fontsize=15)
ax.set_xlabel('Highway', fontsize=15)
ax.legend(fontsize=15)
plt.show()
Again, auto make up the highest vehicle count for every highway. Apparently, it would seem there are many more small cars in the Ontario province than there are trucks. However, are there areas in the province where trucks may dominate?
We examine again the distribution of vehicular traffic for each highway and sort by totalVehicles and totalTrucks. This time we express totalTrucks as a fraction of totalVehicles to obtain truckPercentage for each highway.
highwayTraffic = (data
.groupby(by='highway')
.mean()
.sort_values(by=['totalVehicles', 'totalTrucks'],
ascending=False)[['single','multi','auto','totalTrucks','totalVehicles']])
highwayTraffic['truckPercentage'] = highwayTraffic['totalTrucks'] * 100 / highwayTraffic['totalVehicles']
highwayTraffic.sort_values(by='truckPercentage', ascending=False)[:10]
ax = highwayTraffic.sort_values(by='truckPercentage',
ascending=False)['truckPercentage'][:10].plot(kind='bar', figsize=(15,8))
labels = ax.get_xticklabels()
ax.set_xticklabels(labels, rotation=70)
ax.set_title('Proportion of Truck Traffic per Highway ', fontsize=20)
ax.set_ylabel('Proportion of Truck Traffic', fontsize=15)
ax.set_xlabel('Highway', fontsize=15)
plt.show()
Some of the areas with the biggest truck traffic appear to be Hwy 3 - Huron Church Rd, Hwy 402, and Hwy 405. We see that even though Hwy 401 and QEW may have the biggest traffic in Ontario province, they actually account for insignificant truck traffic.
These areas with high truck traffic are located on the maps below.
data[data.highway == 'Hwy 3 - Huron Church Rd'].locationDescription.unique()
Amb_bridge_to_Toll = pd.read_csv('./mygeodata/On_Blue_Water_Bridge_Plaza_after_toll_booths_entering_Michigan.csv')
Amb_bridge_to_Toll
Amb_bridge_to_Toll = pd.read_csv('./mygeodata/On_Blue_Water_Bridge_Plaza_after_toll_booths_entering_Michigan.csv')
Blue_Water_Bridge = [Amb_bridge_to_Toll.loc[0, 'Y'], Amb_bridge_to_Toll.loc[0,'X']]
#Add markers to the map
#Create the map
my_map = folium.Map(location = Blue_Water_Bridge, zoom_start = 17)
folium.Marker(Blue_Water_Bridge, popup = 'Blue Water Bridge').add_to(my_map)
#Display the map
my_map
data[data.highway == 'Hwy 402'].locationDescription.unique()
Oil_Heritage_to_Town_Line = pd.read_csv('./mygeodata/On_Highway_402.csv')
Oil_Heritage = [Oil_Heritage_to_Town_Line.loc[0, 'Y'], Oil_Heritage_to_Town_Line.loc[0,'X']]
Town_Line = [Oil_Heritage_to_Town_Line.loc[1, 'Y'], Oil_Heritage_to_Town_Line.loc[1,'X']]
#Add markers to the map
#Create the map
my_map = folium.Map(location = Oil_Heritage, zoom_start = 12)
folium.Marker(Oil_Heritage, popup = 'Oil Heritage Road').add_to(my_map)
folium.Marker(Town_Line, popup = 'Town Line').add_to(my_map)
#Display the map
my_map
These highways with huge truck traffic are located close to border toll stations in Ontario. This would seem to suggest that more trucks than cars are to be found around the borders of Ontario than inside the province itself. Evidently, this supports the suspicion of significant commercial activity at the borders of the United States and Canada.
Next, we examine the directions in which most traffic is recorded in the province.
data.pivot_table(index="direction", columns=["dayWeekNo"], values=["totalVehicles"], aggfunc="mean")
data.pivot_table(index="direction", columns=["dayWeekNo"], values=["totalVehicles"], aggfunc="mean").T
ax = data.pivot_table(index="direction", columns=["dayWeekNo"],
values=["totalVehicles"], aggfunc="mean").T.plot(figsize=(10,7))
ax.set_ylabel('Mean Traffic count per day', fontsize=15)
ax.set_xlabel('Day of Week', fontsize=15)
ax.set_title('Directions of Vehicule Traffic', fontsize=20)
ax.set_xticklabels(range(1,8),rotation=0)
ax.legend(fontsize=15)
None
The highest traffic goes towards East. This is in the direction of downtown Toronto from areas west of GTA (as we saw in the maps shown earlier, the stations which registered this traffic are located west of Toronto). This is no surprise as the central business district is located in this direction. Lesser traffic comes westbound into the province, and the least traffic comes from the northern and southern directions.
This traffic distribution also highlights the reality that those living in the northern or southern directions to/from the center of the city may actually experience lesser traffic than those who live in the eastern/western directions from the city center. This assumes heavily that most traffic heads towards the city center.
Our hypothesis here would be "Central" since that is where Toronto belong. But we could check in the data?

The different MTO regions is displayed above. Source: MTO
data.pivot_table(index="MTORegion", columns=["dayWeekNo"], values=["totalVehicles"], aggfunc="mean")
ax = data.pivot_table(index="MTORegion", columns=["dayWeekNo"],
values=["totalVehicles"], aggfunc="mean").T.plot(figsize=(10,7))
ax.set_ylabel('Mean Traffic count per day', fontsize=15)
ax.set_xlabel('Day of Week', fontsize=15)
ax.set_title('MTO Region', fontsize=20)
ax.set_xticklabels(range(1,8),rotation=0)
ax.legend(fontsize=15)
None
For every single day of the week, the central MTO region records the highest traffic. This is no surprise as Toronto belongs here and is the province's financial district. The central region has the highest mean traffic count per day than any other province. Expressed as percentages, we see in the below dataframe that Central contributes nearly 70% of the traffic experienced each day in the province. This is indeed very significant. Southwestern comes a far second.
We should be able to easily show that this traffic is comprised mostly of auto.
round(data.pivot_table(index="MTORegion", columns=["dayWeekNo"], values=["totalVehicles"], aggfunc="mean") * 100 \
/ data.pivot_table(index="MTORegion", columns=["dayWeekNo"], values=["totalVehicles"], aggfunc="mean").sum(axis=0), 3)
data.groupby('dayWeekNo').mean().iloc[:, 1:4]
The mean traffic per day seems almost equal for auto. For trucks, with the exception of Monday and Sunday, the mean traffic also seems fairly same.
round((((data.groupby('dayWeekNo').mean()).iloc[:, 1:4]).T * 100/
(data.groupby('dayWeekNo').mean()).iloc[:, 1:4].sum(axis=1)).T, 2)
ax = round((((data.groupby('dayWeekNo').mean()).iloc[:, 1:4]).T * 100/
(data.groupby('dayWeekNo').mean()).iloc[:, 1:4].sum(axis=1)).T, 2).plot(kind='bar', figsize=(16,7))
ax.legend(loc='upper left')
ax.set_title('Traffic Proportion per Day of the Week', fontsize=20)
ax.set_xlabel('Day of the Week', fontsize=15)
ax.set_ylabel('Traffic Proportion', fontsize=15);
Expressed as a percentage, the size of traffic for each day is almost equal. With the exception of the first and last day of the week where traffic is only slightly higher, traffic on other days seems pretty same. Truck traffic also seems significantly lower on these first and last days of the week compared to the other days. This is probably not unconnected to the fact that most commercial business happens in the thick of the weekday.
In this case the raw traffic count offers no more insight than the percentage count.
Our suspicion this time would be the peak hours of the days, between 11 in the morning and 4 in the evening. But we could take a look at the data.
ax = round((data.groupby('hour').mean()).iloc[:, 1:4], 2).plot(kind='bar', figsize=(16,7))
ax.legend(loc='upper left')
ax.set_title('Traffic Size per Hour of the Day', fontsize=20)
ax.set_xlabel('Hour of the Day', fontsize=15)
ax.set_ylabel('Traffic Size', fontsize=15);
The raw counts show significant decrease in traffic during the morning hours between midnight and 5am. As the morning kicks off, traffic starts to increase, before slowing decreasing towards evening. The highest traffic hours are in fact between 3pm and 6pm, apparently during closing hours. While it may look like totalTrucks traffic is just as significant as auto traffic during the day, the percentage plot below offer a completely different insight.
round((((data.groupby('hour').mean()).iloc[:, 1:4]).T * 100/
(data.groupby('hour').mean()).iloc[:, 1:4].sum(axis=1)).T, 2)
sns.set()
ax = round((((data.groupby('hour').mean()).iloc[:, 1:4]).T * 100/
(data.groupby('hour').mean()).iloc[:, 1:4].sum(axis=1)).T, 2).plot(kind='bar', figsize=(16,7))
ax.legend(loc='upper left')
ax.set_title('Traffic Proportion per Hour of the Day', fontsize=20)
ax.set_xlabel('Hour of the Day', fontsize=15)
ax.set_ylabel('Traffic Proportion', fontsize=15);
With this percentage plot we see that the percentage of auto plying the roads decreases somewhat in the morning hours. During this time the percentage of totalTruck traffic actually increases significantly, confirming the reality that most trucks move during the morning hours between 1 to 5. Clearly, a higher fraction of trucks move during the early morning hours than during the day time. However, for the most part, auto or passenger vehicles still dominate traffic.
We attempt to scale and cluster the data set in order to gain some additional insights.
# import required libraries
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
# assign numerical values in single, multi, auto and totalTrucks to X
# Aassign labels to Y. Labels contain statonID.
X = data[['single', 'multi', 'auto', 'totalTrucks']]
Y = data['stationID']
# scale input X values to remove influence of larger numerical values like auto
# on the machine learning algorithm
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
#set color per stationID
x = np.arange(0,104)
x1 = np.repeat(x, 168, axis=0)
# Sample the stations every 168 index to obtain unique numeric values
# corresponding to each station
Y.loc[1:len(Y):168].head()
#apply kmeans to generate a cluster of 10
# A cluster of 10 is arbitrary. We could as well use a cluster of 4.
# This will depend on how fine we intend to class traffic density
kmeans = KMeans(n_clusters=10)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
y_kmeans[:40]
# concatenate labels with output from clustering algorithm
pd.concat([Y,pd.Series(y_kmeans)],axis=1).head()
X.head()
First, we plot totalTrucks data on the y-axis against the other attributes and attempt pairwise comparison for all the groups.
plt.scatter?
fig, ax = plt.subplots(2, 2, figsize=(12,9))
for i,attrb in enumerate(['single', 'multi', 'auto']):
ax[i//2,i%2].scatter(X[attrb], X['totalTrucks'])
ax[i//2,i%2].set(xlabel=attrb,
ylabel='totalTrucks',
title=attrb+' vs. totalTrucks')
x = np.linspace(0,max(X[attrb]),50)
y = np.linspace(0,1600,50)
ax[i//2,i%2].plot(x,y, linewidth=3, linestyle="-.", color='red')
plt.tight_layout()
totalTrucks is the sum of single and multi-unit trucks. The scatter plot appears to suggest that when there are lesser numbers of totalTrucks on the road, they are mostly multi trucks. The scatter plot multi vs. totalTrucks has nearly a slope of 1 at lower values. This suggests that the number of totalTrucks is almost as much as the number of multi. As a result, one can infer that there are more multi unit trucks on Ontario roads than there are single unit trucks.
The auto vs. totalTrucks scatter plot offers better insight. Clearly, there are many more auto than there are other vehicles. At areas of high auto density, it appears traffic of totalTrucks is low. This is because there are more scatter points on the south side of the plot than there are on the north end. Again, we see here that overall, there are fewer locations where traffic is dense. Most traffic in the province is light. This is seen on the scatter plot in the clustering of most points at lower values of auto and totalTrucks.
..
Before applying kmeans clustering, we generate first a scatter plot where the points are colored based on stationID. We see here that areas of low traffic are clustered together, while areas of high traffic are a little spread apart. Many of the areas of low traffic overlap each other. It would seem that areas of high traffic do not interfere significantly with areas of lower traffic.
Areas of high traffic also seem to have totalTrucks and auto in near equal measure. This could suggest that passenger vehicles and trucks contribute near equal traffic in the province.
We do have to realize here that colors will be distributed about the plot, because areas of low traffic is mixed in with areas of high traffic. The plot has been colored per stationID.
plt.figure(figsize=(10,8))
plt.scatter(X_scaled[:,2], X_scaled[:,3], c=x1, cmap='viridis')
plt.xlabel('auto', fontsize=15)
plt.ylabel('totalTrucks', fontsize=15)
plt.title('auto vs. totalTrucks', fontsize=20)
plt.show()
Below, we show the scatter plot colored based on cluster number. kmeans does appear to have done an excellent job in correctly clustering the data set into regions from very low traffic to rather high traffic. Lower traffic regions are closer to the origin, while higher traffic regions are away from the origin. We could probably proceed from here by applying a label to each of the record. This could be used in a classification task to predict traffic volume in the future if provided with new data set.
plt.figure(figsize=(10,8))
plt.scatter(X_scaled[:,2], X_scaled[:,3], c=y_kmeans, cmap='viridis')
plt.xlabel('auto', fontsize=15)
plt.ylabel('totalTrucks', fontsize=15)
plt.title('auto vs. totalTrucks', fontsize=20)
plt.show()
Next, we attempt to forecast traffic on the busiest highway, Hwy 401, using the time series data available. We will use a simple fft algorithm for this task. This is available in numpy.
data[['single', 'multi', 'auto', 'totalTrucks']][data['stationID']=='ON0115'].plot(figsize=(10,8))
None
# We will attempt to predict 100 units into the future using only the
# auto attribute
auto = data[['single', 'multi', 'auto', 'totalTrucks']][data['stationID']=='ON0115'].auto
auto_predict = 100
n = auto.size
harmonics = 10
t = np.arange(0,n)
p = np.polyfit(t, auto, 1) # generate linear coefficients for the time data
auto_notrend = auto - p[0] * t #remove intercept term to create a no trend output
auto_freqdom = np.fft.fft(auto_notrend) #Apply fft
f = np.fft.fftfreq(n)
indexes = range(n)
list(indexes).sort(key = lambda i: np.absolute(f[i]))
plt.figure(figsize=(10,8))
t = np.arange(0, n+auto_predict)
new_auto = np.zeros(t.size)
for i in indexes[:1 + harmonics*20]:
ampli = np.absolute(auto_freqdom[i]) / n
phase = np.angle(auto_freqdom[i])
new_auto += ampli * np.cos(2 * np.pi * f[i] * t + phase) #Generate wave with amplitude and phase
new_auto = new_auto + p[0] * t #Add back intercept term
plt.plot(np.arange(0, new_auto.size), new_auto, 'r', label='Prediction') #Create plots from here
plt.plot(np.arange(0, auto.size), x, 'b', label='Original', linewidth = 1)
plt.xlabel('Time', fontsize=15)
plt.ylabel('Auto Traffic Density', fontsize=15)
plt.legend(fontsize=15, loc='upper left')
plt.title('Prediction of Auto Vehicular Traffic using fft', fontsize=20)
plt.show()
It is not unexpected that we obtain a repetition of past values. fft seems quite suitable for this prediction. Vehicular traffic will almost always follow this pattern, except on unusual days during which other external factors may affect vehicle traffic.
Now, we would like to examine the chances of being stuck in traffic, or more appropriately what the chances of being stuck in traffic on a particular highway is. With the volume of data we have this appears to be an an analysis we could conduct. We will be using logistic regression for this and attempting to obtain an average figure.
data.head()
The histograms earlier obtained of the log transformations of the data set appears to be a suitable entry into how we may classify the data. The log transformation is more appropriate here, as the original data set appears clustered on one side of the histogram chart. Since we do not have a label, we will have to create one. To maintain simplicty we will not be using the results from the 10 clusters earlier created, instead we will just create two. We will be creating labels with 'High Traffic' (or 1) and 'Low Traffic' (or 0). We will use the auto values to classify the available data and then proceed to train our model with the rest of the data set.
data1 = data.copy()
data1 = np.log(data1[['single','multi','auto','totalTrucks','totalVehicles']] + 1)
sns.set(font_scale = 3)
melt_data = data1[['single','multi','auto','totalTrucks','totalVehicles']].melt()
g = sns.FacetGrid(melt_data, col="variable", col_wrap= 2, sharey=False, sharex=False, height=10, aspect=2)
g = g.map(sns.distplot, "value", kde=True)
data1.describe()
Here, we divide the data by the auto column. 50% of the data set where the log transformation
#Obtain indices for rows in data set where transformed value of auto is greater than 5.05
data1[data1['auto'] > 5.05].index
len(data1[data1['auto'] <= 5.05].index) #Just about half of the available data set.
xdata = data.copy()
xdata['Traffic'] = data1['auto'].apply(lambda x: 1 if x > 5.05 else 0) #Class traffic into high or low
xdata.head()
# Divide data into training and test set
n = xdata.sample(frac=0.7, random_state=123).index
X_data_train = X_data.iloc[n, :]
y_data_train = y_data.iloc[n]
X_data_test = X_data.iloc[np.asarray(
[item
for item in xdata.index
if item not in n]), :]
y_data_test = y_data.iloc[np.asarray(
[item
for item in xdata.index
if item not in n])]
#Prepare X data for regression
X_data_train.loc[:, 'intercept'] = 1
X_data_train.head()
import statsmodels.api as sm
'''Perform the fit with Logit method.'''
fit_logit = sm.Logit(y_data_train.values, X_data_train.values)
fit_logit = logitReg.fit()
fit_logit.summary()
The coef for single appears to be statistically significant given its low P value. The coef of multi however is not. The Pseudo R-squ value also appears a little low.
We use sklearn next to obtain a fit and then predict with the test data. Since we used the auto column to prepare the classes, we will take it out of the dataset.
from sklearn.linear_model import LogisticRegression
X_data = data1[['single', 'multi']]
y_data = data1['Traffic']
# Create logistic regression object
fit_lg = LogisticRegression(C=1e9, solver='liblinear')
#Conduct fit
fit_lg.fit(X_data_train, y_data_train)
#Conduct prediction
y_pred = fit_lg.predict(X_data_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_pred, y_data_test) #accuracy score
proba = fit_lg.predict_proba(X_data_test)
proba
X_data_test1 = data1.iloc[np.asarray(
[item
for item in xdata.index
if item not in n]), :]
# Add some jitter to the probability values
jitter = np.random.randn(len(y_pred)) * 0.004 * (max(y_pred) - min(y_pred))
jitter
#Here we plot the regression curves.
#We add little jitter to the plot to reduce the effect of overplotting
sns.set()
plt.figure(figsize=(7,5))
plt.scatter(X_data_test['single'], y_pred+jitter, alpha=0.8, s=15)
plt.scatter(X_data_test1['single'], v[:,1:].ravel(), linestyle='--', s=1)
plt.scatter(X_data_test1['single'], v[:,:1].ravel(), linestyle='--', s=1)
# plt.scatter(X_data_test['auto'], y_data_test, color='red', alpha=0.1)
plt.xlabel('Single Unit', fontsize=18)
plt.ylabel('Traffic Probability', fontsize=18);
# plt.plot(X_data_test1['auto'], v[:,:1].ravel());
fit_lg.coef_, fit_lg.intercept_ #We obtain the same coefficients and intercept as with statsmodels
ydata_ = data1.iloc[np.asarray(
[item
for item in xdata.index
if item not in n])]
ydata_.head()
ydata = ydata_.reset_index(drop=True)
ydata.head()
#We examine the prediction against actual
ydata['prediction'] = y_pred
ydata['proba'] = proba[:,1:].ravel()
highway = data.iloc[ydata.index].reset_index(drop=True).highway
ydata['highway'] = highway
ydata.head()
#Display of the different probabilities
ydata1.groupby('highway').mean()[['proba']].sort_values(by='proba', ascending=False)
Above, we have some idea of what the chances are of being stuck in traffic on the different highways available in the test set. These values are the mean probability values, and with an overall accuracy of about 78% provided by the model, we can trust some of the results obtained. Clearly, we confirm our suspicion that Hwy 401 remains problematic. It registers as the most probable highway one can get stuck in traffic.
Also, some of the highways with low probabilities appear indeed very free compared to the ones with higher probabilities. We obtain some of the traffic variations over the day on these highways and could confirm, for example, that Hwy 138 (with probability of 0.199) rarely experiences heavy traffic. Hwy 401 on the other hand shows intense traffic on weekdays, especially at peak hours.
Traffic on Hwy 3 - Huron Church Road shown below here. This is close to the border with the US and apparently has the biggest proportion of truck traffic in Ontario. We see from the data that getting stuck in traffic here is quite high. The gif below, taken on a Monday from 0000hrs to 2359hrs suggests this may be true. Green colours represent fast traffic, orange medium and red intense traffic. Because it starts at midnight, the gif progresses from green to orange and then red, as traffic increases. On Hwy 3, we see especially high traffic just around the border of the two countries and progressively higher traffic as the day goes by.

Another gif taken below of Hwy 401 shows just how heavy traffic can get on this highway. This gif, again taken on a Monday from 0000hrs to 2359hrs, shows the intensity of traffic on the highway, especially south of Keele Street (bright red on map) where station ON0115 from our data set is located, validating our earlier analysis that this station records the biggest count of vehicle movement in the province. The time lapse shows that traffic is quite high in the morning hours, but is much worse in the afternoon. This is just as we saw in the available dataset. Without doubt, one has the highest chance of getting stuck in traffic at this location. Our probability value of 0.87 seems very reasonable.

The final gif below is from Hwy 138 where we obtained the probability of getting stuck in traffic as about 0.199, a very low value. There is here near zero record of traffic. The highway is located south east of Ottawa, near Warina. This highway appears one of the quietest in the province, as revealed in the available data and also on this traffic gif.

All of these gifs have been taken from google maps and reflect the current situation of traffic in the province. Our data set, on the other hand, is from 2008. The similarity between our analysis and this most recent data suggests that, at least, the situation of traffic in the province does not seem to have changed very much over time. This is a huge claim and will certainly require that we examine more recent data in other to make sure of this conclusion, else we may get trapped in conclusions with far-reaching consequences which have implications on the province development and possibly the economy. While it is unlikely that the volume of cars has not increased over the years, it is quite possible that the patterns of traffic have not changed, especially if a lot has not changed about the development layout of the province with respect to Toronto.
After examining the available data set, we were able to derive some key insights. Some of the biggest traffic in the province happen on Hwy 401, QEW and King Edward Ave. While these may have the highest density of cars and other passenger vehicles, they do not feature among the list of highways which carry truck traffic. These highways, which have the biggest truck traffic, are much closer to the border of the province than they are to the city center. We also found that passenger vehicle traffic consititute the biggest traffic in the province. More passengers move in the province than do goods. While most traffic happen towards the province's financial district of Toronto, other areas outside of Toronto also have traffic which is just as significant, for example QEW around Hamilton. However, the highest traffic is towards the east. This could also be because of the location of the city's financial district with respect to all other locations. The northern and southern directions carry the least traffic.
We also found something curious about the time plot for highway Hwy 401. Though traffic begins to increase at around 5am and peaks at noon, it appears that for the first and last day of the week, the slopes are much gentler than during the other days of the week. Could this suggest some lethargy preventing people from hurrying to work on a Monday and more people relaxing at home on the weekends compared to other busy days of the week? The other denser highway traffic on QEW and King Edward Ave also follow this trend. Traffic on the last day of the week also did not peter out in the evening as does traffic on other days, possibly suggesting that more people spent longer nights out during the last day of the week?
Clustering offers an avenue to cluster the data set into a spectrum ranging from high traffic to minimal traffic based on the size of the traffic for any given hour in a day. This generates an output that could help with labeling the data set and with which further classification could be done if provided with new data set. We went on to predict some future traffic on the biggest highway, Hwy 401 using fft. The easy predicatability of traffic patterns makes fft a suitable method for this job, as it can easily duplicate repetitive changes characteristic of traffic.
Finally, we used logistic regression to examine the chances of getting stuck on a particular highway. With the generated model we were able to reinforce our earlier conclusion that Hwy 401 has the biggest vehicle movement and offers the greatest chance of being stuck in traffic. The probability was as high as 80%. We saw also that some other highways, like Hwy 138, offer the lowest possibility of traffic. Sample gifs obtained from google maps helped verify our analysis. Comparing our gifs, obtained in 2020, to results from a data set from 2008 shows some eerie similarity which suggests that traffic patterns may not have changed that greatly in the province over the past years.
Unfortunately, some of the conclusions derived from this data are limited in reach. This is because the data set only offers a week of data. Besides it is from 2008. Even though the data set is dense enough to derive quality insights, it is still limited in its span. Were it available over several months or years, it would most certainly have enabled better quality insights in its analysis. The data set also does not provide information on construction activities that may have been ongoing while the stations collected traffic information. In the absence of this, we are left to speculate on the cause of certain traffic patterns present in the data set. All of our traffic causes have been attributed to high vehicle density. It is only natural to suspect that this may not have been the case.
Finally, the non-availability of GPS coordinates in the original data set greatly limited what could be achieved in terms of analysis. A good recommendation would be to include this in subsequent provisions of future data set of this kind. GPS coordinates for traffic data are some of the most indispensable attributes which offer better analysis potential.
Ontario MTO (2019). 2006 Commercial Vehicle Survey: traffic volumes at survey stations (data set available https://data.ontario.ca/dataset/2006-commercial-vehicle-survey-traffic-volumes-at-survey-stations).
Jake VanderPlas (2016). Python Data Science Handbook (available at https://jakevdp.github.io/PythonDataScienceHandbook/). O'Reilly.